Leveraging Category-based LSI for Patent Retrieval

نویسنده

  • Masaki Aono
چکیده

Latent Semantic Indexing (LSI) has been employed to reduce dimension of indices of documents for similarity search. In this paper, we will describe a method for retrieving conceptually similar patents first by categorizing patent collection and then by applying LSI algorithm multiple times to each category. The main strategy is keeping the algorithm as simple as possible, while achieving the scalability for massive dataset. During the categorization phase, we allow any patent to be classified into multiple categories, which allows patent document overlaps among different categories. Then, for each category, we applied dimensional reduction using LSI to each category into a much lower dimension. Finally, once a query as a collection of claim sentences for a patent is given, we select the most similar category, and return top fifty ranked patent documents as candidates to invalidate the query document.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Latent Semantic Indexing for Patent Documents

Since the huge database of patent documents is continuously increasing, the issue of classifying, updating and retrieving patent documents turned into an acute necessity. Therefore, we investigate the efficiency of applying Latent Semantic Indexing, an automatic indexing method of information retrieval, to some classes of patent documents from the United States Patent Classification System. We ...

متن کامل

Invalidity Patent Search System of NTT DATA

In this paper, we give an overview of our invalidity patent search system for NTCIR-4 PATENT. The system is based on the document retrieval technique and the new methods that are suitable for the invalidity search; the query term extraction based on characteristics of invention, the retrieval model using components of invention, the ranking using the term weighting based on category information...

متن کامل

LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier

The task of Text Classification (TC) is to automatically assign natural language texts with thematic categories from a predefined category set. And Latent Semantic Indexing (LSI) is a well known technique in Information Retrieval, especially in dealing with polysemy (one word can have different meanings) and synonymy (different words are used to describe the same concept), but it is not an opti...

متن کامل

Performance Evaluation of Medical Image Retrieval Systems Based on a Systematic Review of the Current Literature

Background and Aim: Image, as a kind of information vehicle which can convey a large volume of information, is important especially in medicine field. Existence of different attributes of image features and various search algorithms in medical image retrieval systems and lack of an authority to evaluate the quality of retrieval systems, make a systematic review in medical image retrieval system...

متن کامل

A Content Vector Model for Text Classification

As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications. In this paper, an LSI-based content vector model for text classification is presented, which constructs multiple augmented category LSI spaces and classifies text by their content. The model integrates the class discriminative information from the traini...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007